multi-modal learning
Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter-or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- \& intra-modality modeling (I2M2) framework, which captures and integrates both the inter-and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.
What Makes Multi-Modal Learning Better than Single (Provably)
The world provides us with data of multiple modalities. Intuitively, models fusing data from different modalities outperform their uni-modal counterparts, since more information is aggregated. Recently, joining the success of deep learning, there is an influential line of work on deep multi-modal learning, which has remarkable empirical results on various applications. However, theoretical justifications in this field are notably lacking. Can multi-modal learning provably perform better than uni-modal?In this paper, we answer this question under a most popular multi-modal fusion framework, which firstly encodes features from different modalities into a common latent space and seamlessly maps the latent representations into the task space. We prove that learning with multiple modalities achieves a smaller population risk than only using its subset of modalities. The main intuition is that the former has a more accurate estimate of the latent space representation. To the best of our knowledge, this is the first theoretical treatment to capture important qualitative phenomena observed in real multi-modal applications from the generalization perspective. Combining with experiment results, we show that multi-modal learning does possess an appealing formal guarantee.
Mitigating Modality Imbalance in Multi-modal Learning via Multi-objective Optimization
Fernando, Heshan, Ram, Parikshit, Zhou, Yi, Dan, Soham, Samulowitz, Horst, Baracaldo, Nathalie, Chen, Tianyi
Multi-modal learning (MML) aims to integrate information from multiple modalities, which is expected to lead to superior performance over single-modality learning. However, recent studies have shown that MML can underperform, even compared to single-modality approaches, due to imbalanced learning across modalities. Methods have been proposed to alleviate this imbalance issue using different heuristics, which often lead to computationally intensive subroutines. In this paper, we reformulate the MML problem as a multi-objective optimization (MOO) problem that overcomes the imbalanced learning issue among modalities and propose a gradient-based algorithm to solve the modified MML problem. We provide convergence guarantees for the proposed method, and empirical evaluations on popular MML benchmarks showcasing the improved performance of the proposed method over existing balanced MML and MOO baselines, with up to ~20x reduction in subroutine computation time. Our code is available at https://github.com/heshandevaka/MIMO.
Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter-and intra-modality dependencies, leading to more accurate predictions.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
The most significant improvement was made for unimodal query when we increase the iteration number from 0 to 1 (margin of improvement: 0.034), but we still observe further gain by increasing it to 5 and 10 (margin of improvement: 0.048 and 0.05). Based on your suggestion on multi-prediction training that approximates joint likelihood, we evaluated the performance of the multimodal deep network trained jointly on $x$ and $y$ like in the original MP-DBM (i.e., randomly select subsets of variables from both data modalities $x$ and $y$ and predict them given the rest). In our preliminary results, the original MP-DBM style training jointly on $x$ and $y$ gave worse results than our proposed training scheme (i.e., predicting $x$ given $y$ and vice versa) for both multimodal and unimodal queries. We will include complete results in the revision. R38: Fine-tuning brings a significant improvement: before MDRNN fine-tuning, we obtained 0.632 and 0.521 test set mAPs for multimodal and unimodal queries, respectively, and these numbers go up to 0.686 and 0.607 mAPs after MDRNN fine-tuning.
Multi-Modal Learning with Bayesian-Oriented Gradient Calibration
Guo, Peizheng, Wang, Jingyao, Guo, Huijie, Li, Jiangmeng, Sun, Chuxiong, Zheng, Changwen, Qiang, Wenwen
Multi-Modal Learning (MML) integrates information from diverse modalities to improve predictive accuracy. However, existing methods mainly aggregate gradients with fixed weights and treat all dimensions equally, overlooking the intrinsic gradient uncertainty of each modality. This may lead to (i) excessive updates in sensitive dimensions, degrading performance, and (ii) insufficient updates in less sensitive dimensions, hindering learning. To address this issue, we propose BOGC-MML, a Bayesian-Oriented Gradient Calibration method for MML to explicitly model the gradient uncertainty and guide the model optimization towards the optimal direction. Specifically, we first model each modality's gradient as a random variable and derive its probability distribution, capturing the full uncertainty in the gradient space. Then, we propose an effective method that converts the precision (inverse variance) of each gradient distribution into a scalar evidence. This evidence quantifies the confidence of each modality in every gradient dimension. Using these evidences, we explicitly quantify per-dimension uncertainties and fuse them via a reduced Dempster-Shafer rule. The resulting uncertainty-weighted aggregation produces a calibrated update direction that balances sensitivity and conservatism across dimensions. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and advantages of the proposed method.
Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- \& intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions.